Even after successful delivery, a message may still fail
during playback to the service. Such failures typically abort the
playback transaction, which causes the message to return to the service
queue. WCF will then detect the message in the queue and retry. If the
next call fails too, the message will go back to the queue again, and so
on. Continuously retrying this way is often unacceptable. If the initial
motivation for the queued service was load leveling, WCF’s auto-retry
behavior will generate considerable stress on the service. You need a
smart failure-handling schema that deals with the case when the call
never succeeds (and, of course, defines “never” in practical terms). The
failure handling will determine after how many attempts to give up,
after how long to give up, and even the interval at which to try.
Different systems need different retry strategies and have different
sensitivity to the additional thrashing and probability of success. For
example, retrying 10 times with a single retry once every hour is not
the same strategy as retrying 10 times at 1-minute intervals, or the
same as retrying 5 times, with each attempt consisting of a batch of 2
successive retries separated by a day. In general, it is better to hedge
your bets on the causes for the failure and the probability of future
success by retrying in a series of batches, to deal with sporadic and
intermediate infrastructure issues as well as fluctuating application
state. A series of batches, each batch comprised of a set number of
retries in rapid succession, may just be able to catch the system in a
state that will allow the call to succeed. If it doesn’t, deferring some
of the retries to a future batch allows the system some time to
recuperate. Additionally, once you have given up on retries, what should
you do with the failed message, and what should you acknowledge to its
sender?
1. Poison Messages
Transactional messaging systems are inherently
susceptible to repeated failure, because the retries thrashing can
bring the system to its knees. Messages that continuously fail
playbacks are referred to as poison messages,
because they literally poison the system with futile retries.
Transactional messaging systems must actively detect and eliminate
poison messages. Since there is no telling whether just one more retry
might actually succeed, you can use the following simple heuristic:
all things being equal, the more the message fails, the higher the
likelihood is of it failing again. For example, if the message has
failed just once, retrying seems reasonable. But if the message has
already failed 1,000 times, it is very likely it will fail again the
1,001st time, so it is pointless to try again. In this case, the
message should be deemed a poison message. What exactly constitutes
“pointless” (or just wasteful) is obviously application-specific, but
it is a configurable decision. MsmqBindingBase
offers a number of properties governing the handling of playback
failures:
public abstract class MsmqBindingBase : Binding,...
{
//Poison message handling
public int ReceiveRetryCount
{get;set;}
public int MaxRetryCycles
{get;set;}
public TimeSpan RetryCycleDelay
{get;set;}
public ReceiveErrorHandling ReceiveErrorHandling
{get;set;}
//More members
}
2. Poison Message Handling in MSMQ 4.0
With MSMQ 4.0 (available on Windows Vista, Windows Server 2008,
and Windows 7 or later), WCF retries playing back a failed message in
series of batches, for the reasoning just presented. WCF provides each
queued endpoint with a retry queue and an optional poison messages
queue. After all the calls in the batch have failed, the message does
not return to the endpoint queue. Instead, it goes to the retry queue
(WCF will create that queue on the fly). Once the message is deemed
poisonous, you may have WCF move that message to the poison
queue.
2.1. Retry batches
In each batch, WCF will immediately retry for ReceiveRetryCount times after the first
call failure. ReceiveRetryCount
defaults to five retries, or a total of six attempts, including the
first attempt. After a batch has failed, the message goes to the
retry queue. After a delay of RetryCycleDelay minutes, the message is
moved from the retry queue to the endpoint queue for another retry
batch. The retry delay defaults to 30 minutes. Once that batch
fails, the message goes back to the retry queue, where it will be
tried again after the delay has expired. Obviously, this cannot go
on indefinitely. The MaxRetryCycles property controls how many
batches at the most to try. The default of MaxRetryCycles is two cycles only,
resulting in three batches in total. After MaxRetryCycles number of retry batches, the
message is considered a poison message.
When configuring nondefault values for MaxRetryCycles, I recommend setting its
value in direct proportion to RetryCycleDelay. The reason is that the
longer the delay is, the more tolerant your system will be of
additional retry batches, because the overall stress will be
somewhat mitigated (having been spread over a longer period of
time). With a short RetryCycleDelay you should minimize the
number of allowed batches, because you are trying to avoid
approximating continuous thrashing.
Finally, the ReceiveErrorHandling property governs what
to do after the last retry fails and the message is deemed
poisonous. The property is of the enum type ReceiveErrorHandling, defined as:
public enum ReceiveErrorHandling
{
Fault,
Drop,
Reject,
Move
}
2.2. ReceiveErrorHandling.Fault
The Fault value
considers the poison message as a catastrophic failure and actively
faults the MSMQ channel and the service host. Doing so prevents the
service from processing any other messages, be they from a queued
client or a regular connected client. The poison message will remain
in the endpoint queue and must be removed from it explicitly by the
administrator or by some compensating logic, since WCF will refuse
to process it again if you merely restart the host. In order to
continue processing client calls of any sort, you must open a new
host (after you have removed the poison message from the queue).
While you could install an error-handling extension to do some of that work, in practice
there is no avoiding involving the application administrator.
ReceiveErrorHandling.Fault
is the default value of the ReceiveErrorHandling property. With this
setting, no acknowledgment of any sort is sent to the sender of the
poison message. ReceiveErrorHandling.Fault is both the
most conservative poison message strategy and the least useful from
the system perspective, since it amounts to a stalemate.
2.3. ReceiveErrorHandling.Drop
The Drop value, as
its name implies, silently ignores the poison message by dropping it
and having the service keep processing other messages. You should
configure for ReceiveErrorHandling.Drop if you have high tolerance for
both errors and retries. If the message is not crucial (i.e., it is
used to invoke a nice-to-have operation), dropping and continuing is
acceptable. In addition, while ReceiveErrorHandling.Drop does allow for
retries, conceptually you should not have too many retries—if you
care that much about the message succeeding, you should not just
drop it after the last failure.
Configuring for ReceiveErrorHandling.Drop also sends an
ACK to the sender, so from the sender’s perspective, the message was
delivered and processed successfully. For many applications,
ReceiveErrorHandling.Drop is an
adequate choice.
2.4. ReceiveErrorHandling.Reject
The ReceiveErrorHandling.Reject value actively
rejects the poison message and refuses to have anything to do with
it. Similar to ReceiveErrorHandling.Drop, it drops the
message, but it also sends a NACK to the sender, thus signaling
ultimate delivery and processing failure. The sender responds by
moving the message to the sender’s dead-letter queue. ReceiveErrorHandling.Reject is a
consistent, defensive, and adequate option for the vast majority of
applications (yet it is not the default, to accommodate MSMQ 3.0
systems as well).
2.5. ReceiveErrorHandling.Move
The ReceiveErrorHandling.Move value is the
advanced option for services that wish to defer judgment on the failed message to a
dedicated third party. ReceiveErrorHandling.Move moves the message to the
dedicated poison messages queue, and it does not send back an ACK or
a NACK. Acknowledging processing of the message will be done after
it is processed from the poison messages queue. While ReceiveErrorHandling.Move is a great
choice if indeed you have some additional error recovery or
compensation workflow to execute in case of a poison message, a
relatively smaller set of applications will find it useful, due to
its increased complexity and intimate integration with the
system.
2.6. Configuration sample
Example 1
shows a configuration section from a host config file, configuring
poison message handling on MSMQ 4.0.
Example 1. Poison message handling on MSMQ 4.0
<bindings>
<netMsmqBinding>
<binding name = "PoisonMessageHandling"
receiveRetryCount = "2"
retryCycleDelay = "00:05:00"
maxRetryCycles = "2"
receiveErrorHandling = "Move"
/>
</netMsmqBinding>
</bindings>
|
Figure 1
illustrates graphically the resulting behavior in the case of a
poison message.
2.7. Poison message service
Your service can provide a dedicated poison message–handling
service to handle messages posted to its poison messages queue when
the binding is configured with ReceiveErrorHandling.Move. The poison message service must
be polymorphic with the service’s queued endpoint contract. WCF will
retrieve the poison message from the poison queue and play it to the
poison service. It is therefore important that the poison service
does not throw unhandled exceptions or abort the playback
transaction . Such a poison
message service typically engages in some kind of compensating work
associated with the failed message, such as refunding a customer for
a missing item in the inventory. Alternatively, a poison service
could do any number of things, including notifying the
administrator, logging the error, or just ignoring the message
altogether by simply returning.
The poison message service is developed and configured like
any other queued service. The only difference is that the endpoint
address must be the same as the original endpoint address, suffixed
by ;poison. Example 9-20 demonstrates the
required configuration of a service and its poison message service.
In Example 2, the
service and its poison message service share the same host process,
but that is certainly optional.
Example 2. Configuring a poison message service
<system.serviceModel>
<services>
<service name = "MyService">
<endpoint
address = "net.msmq://localhost/private/MyServiceQueue"
binding = "netMsmqBinding"
bindingConfiguration = "PoisonMesssageSettings"
contract = "IMyContract"
/>
</service>
<service name = "MyPoisonServiceMessageHandler">
<endpoint
address = "net.msmq://localhost/private/MyServiceQueue;poison"
binding = "netMsmqBinding"
contract = "IMyContract"
/>
</service>
</services>
<bindings>
<netMsmqBinding>
<binding name = "PoisonMesssageSettings"
receiveRetryCount = "..."
retryCycleDelay = "..."
maxRetryCycles = "..."
receiveErrorHandling = "Move"
/>
</netMsmqBinding>
</bindings>
</system.serviceModel>
|
You
should avoid having two or more service endpoints monitoring the
same queue, since this will result in them processing each other’s
messages. You may be tempted, however, to leverage this behavior
as a load balancing of sort: you can deploy your service on
multiple machines while having all the machines share the same
queue. The problem in this scenario is poison message handling. It
is possible for one service to return the message to the queue
after an error for a retry, and then have a second service start
processing that message, not knowing it was already tried or how
many times. I believe the fundamental problem here is not with
load balancing queued calls and playback errors; rather, it is
with the need to load balance the queued calls in the first place.
Load balancing is done in the interest of scalability and
throughput. Both scalability and throughput have a temporal
quality—they imply a time constraint on the level of performance
of the service, and yet queued calls, by their very nature,
indicate that the client does not care exactly when the calls
execute.
Nonetheless, to enable multiple services to share a queue
and manage playback errors, WCF defines the helper class
ReceiveContext,
defined as:
public abstract class ReceiveContext
{
public virtual void Abandon(TimeSpan timeout);
public virtual void Complete(TimeSpan timeout);
public static bool TryGet(Message message,
out ReceiveContext property);
public static bool TryGet(MessageProperties properties,
out ReceiveContext property);
//More members
}
You enable the use of ReceiveContext with the ReceiveContextEnabled attribute:
public sealed class ReceiveContextEnabledAttribute : Attribute,
IOperationBehavior
{
public bool ManualControl
{get;set;}
//More members
}
After a failure, you can use ReceiveContext to lock the message in
the queue and prevent other services from processing it. This,
however, results in a cumbersome programming model that is nowhere
near as elegant as the transaction-driven queued services. I
recommend you design the system so that you do not need to load
balance your queued services and that you avoid ReceiveContext altogether.
|
3. Poison Message Handling in MSMQ 3.0
With MSMQ 3.0 (available on Windows XP and Windows Server 2003),
there is no retry queue or optional poison queue. As a result, WCF
supports at most a single retry batch out of the original endpoint
queue. After the last failure of the first batch, the message is
considered poisonous. WCF therefore behaves as if MaxRetryCycles is always set to 0, and the value of RetryCycleDelay is ignored. The only values
available for the ReceiveErrorHandling property are ReceiveErrorHandling.Fault and
ReceiveErrorHandling.Drop. Configuring other values throws
an InvalidOperationException at
the service load time.